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Abstract. String barcoding is a recently introduced technique for genomic-based identification of 
microorganisms. In this paper we describe the engineering of highly scalable algorithms for robust string 
barcoding. Our methods enable distinguisher selection based on whole genomic sequences of hundreds 
of microorganisms of up to bacterial size on a well-equipped workstation, and can be easily parallelized 
to further extend the applicability range to thousands of bacterial size genomes. Experimental results 
on both randomly generated and NCBI genomic data show that whole-genome based selection results 
in a number of distinguishers nearly matching the information theoretic lower bounds for the problem. 

1 Introduction 

String barcoding is a recently introduced technique for genomic-based identification of microorganisms such 
as viruses or bacteria. The basic barcoding problem [15] is formulated as follows: given the genomic sequences 
gi,...g n of n microorganisms, find a minimum number of strings ti,.. . ,tfc distinguishing these genomic 
sequences, i.e., having the property that, for every gi ^ gj, there exists a string ti which is a substring of gi 
or gj , but not of both. A closely related formulation was independently proposed in [3] , where it is assumed 
that it is possible to detect not just the presence or absence of a distinguisher ti, but also the number of 
repetitions of ti as a substring, up to a threshold of R > 0. The formulation in [15], which we adopt in this 
paper, corresponds to R — 1. 

Identification is performed by spotting or synthesizing on a microarray the Watson-Crick complements 
of the distinguisher strings ti,...,tk, and then hybridizing to the array the fluorescently labeled DNA 
extracted from the unknown microorganism. Under the assumption of perfect hybridization stringency, the 
hybridization pattern can be viewed as a string of fc zeros and ones, referred to as the barcode of the 
microorganism. By construction, the barcodes corresponding to the n microorganisms are distinct, and thus 
the barcode uniquely identifies any one of them. To improve identification robustness, one may also require 
redundant distinguishability (i.e., at least m different distinguishers for every pair of microorganisms, where 
to > 1 is some fixed constant) and impose a lower bound on the edit distance between any pair of selected 
distinguishers [15]. 

The algorithms previously proposed for string barcoding are based on integer programming [15], and 
on Lagrangian relaxation and simulated annealing [3] . Unfortunately, the run-time of these algorithms does 
not scale well with the number of microorganisms and the length of the genomic sequences, e.g., the largest 
instance sizes reported in [15] have a total genomic sequence length of around 100,000 bases. 

In this paper we describe the engineering of highly scalable algorithms for robust string barcoding. Our 
methods enable distinguisher selection based on whole genomic sequences of hundreds of microorganisms 
of up to bacterial size on a well-equipped workstation, and can be easily parallelized to further extend the 
applicability range to thousands of bacterial size genomes. Whole-genome based selection is beneficial in at 
least two significant ways. First, it simplifies assay design since the DNA of the unknown pathogen can be 
amplified using inexpensive general-purpose whole-genome amplification methods such as specialized forms 
of degenerate primer multiplex PCR [5] or multiple displacement amplification [8]. Second, whole-genome 
based selection results in a reduced number of distinguishers, often very close to the information theoretic 
lower bound of |~log 2 n\ . 
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IIS-0346973. The work of IIM was supported in part by a "Large Grant" from the University of Connecticut's 
Research Foundation. 



Our algorithms are based on a simple greedy selection strategy - in every iteration we pick a substring that 
distinguishes the largest number of not-yet-distinguished pairs of genomic sequences. This selection strategy 
is an embodiment of the greedy setcover algorithm (see, e.g., [17]) for a problem instance with 0(n 2 ) elements 
corresponding to the pairs of sequences. Hence, by a classical result of [6, 11, 13], our algorithm guarantees 
an approximation factor of 2 Inn for the barcoding problem. Very recently, Berman et al. [1] have shown that 
no approximation algorithm can guarantee a factor of (1 — e) Inn unless NP = DTIME(n loslogn ), and also 
proposed an information content greedy heuristic achieving an approximation factor of 1 + fnn. Experimental 
results given in Section 5 show that our setcover greedy algorithm produces solutions of virtually identical 
quality to those obtained by the information content heuristic. 

The setcover greedy algorithm is extremely versatile, and can be easily extended to handle redundancy 
and minimum edit distance constraints, as well as other biochemical constraints on individual distinguishcr 
sequences. Furthermore, unlike the information content heuristic of [1], the greedy setcover algorithm can 
also take into account genomic sequence uncertainties expressed in the form of degenerate bases. Although 
degenerate bases are ubiquitous in genomic databases, previous works have not recognized the need to 
properly handle them. For example, experiments in [15] have implicitly treated degenerate bases in the input 
genomic sequences as distinct nucleotides; under this approach a substring of degenerate nucleotides such as 
NNNNN, might be erroneously selected as a distinguisher although it encodes for any possible substring of 
length 5. 

To achieve high scalability, our implementation relies on several techniques. First, we use an incremental 
algorithm for quickly generating a representative set of candidate distinguishcrs and collecting all their 
occurrences in the given genomic sequences. To reduce the number of candidates, we avoid generating any 
substring that appears in all genomic sequences, which typically eliminates very short candidates. For each 
genomic sequence, we also generate only one of the substrings that appear exclusively in that sequence, this 
optimization eliminates from consideration most candidate distinguishcrs above a certain length. Unlike the 
suffix tree method proposed by Rash and Gusfield [15], our approach may generate multiple candidates that 
appear in the same set of k genomic sequences (for 1 < k < n). However, the penalty of having to evaluate 
redundant candidates in the candidate selection phase is offset in practice by the faster candidate generation 
time. Finally, the efficient implementation of the greedy selection phase of algorithm combines a partition 
based method for computing the coverage gain of candidate distinguishers (this method was first proposed 
in the context of the information content heuristic in [1]) with a "lazy" strategy for updating coverage gains. 

The rest of the paper is organized as follows. In Section 2 we give formal problem formulations and review 
previous work. In Section 3 we describe the efficient implementation of the setcover greedy algorithm for the 
basic string barcoding problem. In Section 4 we discuss the modifications required in the implementation for 
handling degenerate bases in input genomic sequences, redundancy and edit distance constraints, as well as 
biochemical constraints such as constraints on melting temperature and GC-content. In Section 5 we give the 
results of a comprehensive experimental study comparing, on both randomly generated and genomic data, 
our setcover greedy algorithms with other scalable methods including the information content heuristic and 
a recent set multicover randomized rounding approximation algorithm. 

2 Preliminaries and Problem Formulation 

Let S = {a, c, g,t} be the DNA alphabet, and S* be the set of string over S. A degenerate base is a non- 
empty subset of E. We identify degenerate bases of cardinality 1 with the respective non-degenerate bases. 
Given a DNA string X — X\ ... X G S* and a string of degenerate bases y — yi . . . y„, n > k, we say that 

— x has a perfect match at position i of y iff yi+j-x — {xj} for every 1 < j < k, 

— x has a perfect mismatch at position i of y iff there exists 1 < j < k such that {xj} % and 

— x has an uncertain match at position i of y iff {xj} C yi+j-i for every 1 < j < k, but yt+j-i ^ {xj} for 
at least one j. 

String X — X \ . . . X G S* distinguishes two sequences of degenerate bases y and z iff (a) x has a perfect 
match at one or more positions of y, and has perfect mismatches at all positions of z, or, symmetrically, 
(b) x has a perfect match at one or more positions of z, and has perfect mismatches at all positions of y. 
The robust string barcoding problem with degenerate bases is formulated as follows: Given sequences of 



degenerate bases gi, . . . g n and redundancy threshold m, End a minimum number of strings ti, . . . , i& G S* 
such that, for every i ^ j, there exist m distinct strings ti distinguishing gi and gj. 

It is easy to see that, for m = 1, at least [log 2 n\ distinguishes arc needed to distinguish any n genomic 
sequences. However, achieving this lower bound requires distinguishcrs that have perfect matches in nearly 
half of the sequences. In practice, additional constraints, such as lower bounds on the length of distinguishers, 
may result in no string having perfect matches in a large number of sequences, and therefore much more 
than a logarithmic number of distinguishcrs. The next theorem, the proof of which we omit due to space 
constraints, establishes under a simple probabilistic model that there is an abundance of distinguishers 
perfectly matching at least a constant fraction of the input sequences. 

Theorem 1. Consider a randomly generated instance of the string barcoding problem over a fixed alphabet 
E in which there are n strings, each string s = sosi . . . se-i is of length exactly £ generated independently 
randomly with Pr[si = a] = for any i and any a G S. Also assume that £ is sufficiently large compared 

to n. Then, for a random string x G S* of length 0(\og£), the expected number of the input strings which 
contain x as a substring is pn for some constant < p < 1 . 

Proof. Assume n and £ to be sufficiently large for asymptotic results and u = \E\ > 1 to be fixed. It 
suffices to show that for a random string x G S* of length k = O(log^), Pr[x is a substring of s] = p for 
some constant < p < 1 and s is any one of the input n strings. In [14, Examples 6.4, 6.7, 6.8, 9.3 and 
10.11], Odlyzko uses the bounds and generating function described in [10] to give asymptotic bounds on 
Pr[x is a substring of s] when a = 2. The result can be generalized to the case of any fixed a > 2 as follows. 
For a fixed x — x\x 2 ■ ■ ■ x k , define the correlation polynomial C x (z) of x as C x (z) — X^j=o c x{j) z ^ where 
c x (0) = 1 and, for 1 < j < k, 

. _ f 1 if xix 2 ■ ■ ■ Xk-j = x j+1 Xj +2 ■ ■ ■ x k 
Cx\J) - | o otherwise 

Let f x (£) be the number of strings in S* of length I that do not contain x as a substring and F x (z) = 
H,eLo fx(£)z e be the generating function for this number. Then, F x (z) = z k + (^}l)c (z) • F rom this, it follows 

that Pr[x ~< s] = 1 — e ' k c x (\i^ _|_ 0{er it l o ^) for all sufficiently large n, k and £, where e is the 

base of natural logarithm. Note that 1 < C x (a) < 2 and for a specific x, C x (a) can be calculated exactly. 
Now, setting k = ©(log^ £) gives Pr[x is a substring of s] = p for some constant < p < 1. □ 

Previous work. The robust string barcoding problem was introduced (for the case when genomic sequences 
contain no degenerate bases) by Rash and Gusfield [15]; they provided some experimental results based on 
integer programming methods, and left open the exact complexity and approximability of this problem. 
The problem without redundancy constraints was independently considered by Borneman et al. [3] , who also 
considered non-binary distinguishability (based on detecting the multiplicity of a distinguisher as a substring) 
and a slightly more general problem in which the objective is to pick a given number of distinguishers 
maximizing the number of distinguished pairs. The main motivation for the formulations in [3] comes from 
minimizing the number of oligonucleotide probes needed for analyzing populations of ribosomal RNA gene 
(rDNA) clones by hybridization experiments on DNA microarrays. Borneman et al. provided computational 
results using Lagrangian relaxation and simulated annealing techniques, and noted that the problem is 
NP-hard assuming that the lengths of the sequences in the prespecified set were unrestricted. Very recently, 
Berman, DasGupta and Kao [1] considered a general framework for test set problems that captured the string 
barcoding problem and its variations; their main contribution is to establish theoretically matching lower 
and upper bounds on the worst-case approximation ratio. Cazalis et al. [4] have independently investigated 
similar greedy distinguisher selection strategies for string barcoding. Unlike our work, the algorithms in [4] 
consider only a small random subset of the possible distinguishers and also prescribe their length in order to 
achieve practical running time. 

3 Efficient Implementation of the Greedy Setcover Algorithm 

In this section we present the implementation of the setcover greedy algorithm in the context of the basic 
string barcoding problem, i.e., we disregard redundancy constraints and the presence of degenerate bases in 



Input: Set C of candidate distinguishes 
Output: Set D of selected distinguishes 

D <— 0; For every c G C, A old (c) <— oo 
Repeat 
Z\* <- 

For every c G C with Zi oH (c) > A* do // Since Z\(c, D) < Z\ (d(c), c can be ignored if A i d (c) < A* 
A old (c) <- Z\(c,D) 

If Z\(c, £>) > Z\* then A* <- Z\(c, D); c* <- c 
If ZT > then fl^Du{c*} 
While A* > 



Fig. 1. The greedy candidate selection algorithm 

the input sequences. Implementation modifications needed to handle the robust barcoding problem in its full 
generality are discussed in Section 4. 

Our implementation of the setcover greedy algorithm has two main phases: a candidate generation phase 
and a candidate selection phase. In the candidate generation phase a representative set of candidate distin- 
guishers is generated from the given genomic sequences. For each generated candidate, we also compute the 
list of sequences with which the candidate has perfect matches; this information is needed in the candidate 
selection phase. To reduce the number of candidates, we avoid generating any substring that appears in all 
genomic sequences, which typically eliminates very short candidates. For each genomic sequence, we also 
make sure to generate only one of the substrings that appear exclusively in that sequence, this optimization 
eliminates from consideration most candidate distinguishers above a certain length. Unlike the suffix tree 
method proposed by Rash and Gusfield [15], our approach may generate multiple candidates that appear in 
the same set of k genomic sequences (for 1 < k < n). However, the penalty of having to evaluate redundant 
candidates in the candidate selection phase is offset in practice by the faster candidate generation time. 

Efficient implementation of the above candidate elimination rules is achieved by generating candidates 
in increasing order of length and using exact match positions for candidates of length I — 1 when generating 
candidates of length I. For each position p in the input genomic sequences, we also maintain a flag to 
indicate whether or not the algorithm should evaluate candidate substrings starting at p. The possible values 
for the flag are TRUE (the substring of current length starting at p is a possible candidate), FALSE (we 
have already saved the substring of current length starting at p as a candidate), or DONE (all candidates 
containing as prefix the substring of current length starting at p are redundant, i.e., the position can be 
skipped for all remaining candidate lengths). Initially all flags are set to TRUE. The FALSE flags are reset 
to TRUE whenever we increment the candidate length, however, we never reset DONE flags. For every 
candidate length I, candidate evaluation proceeds sequentially over all positions of the genomic sequences. 
Whenever we reach a position p whose flag is set to TRUE, we use the list of matches for the substring of 
length I — 1 starting at p (or a linear time string matching algorithm if I is the minimum candidate length) 
to determine the list of matches for the substring of length I starting at p, and set the flag to FALSE for 
all positions where these matches occur. If the substring of length I starting at p has matches only within 
the source sequence, and we have already generated a "unique" candidate for this sequence, we discard the 
candidate and set the flag of p to DONE. 

A further speed-up technique is to generate candidate distinguishers from a strict subset of the input 
sequences. Although this speed-up can potentially affect solution quality, the results in Section 5 show that 
the solution quality loss for whole-genome barcoding is minimal, even when we generate candidates based 
on a single input sequence, which corresponds to pre-assigning a barcode of all l's to this sequence. 

After the set of candidates is generated we select the final set of distinguishers in the greedy phase of the 
algorithm (Figure 1). We start with an empty set of distinguishers D. While there are pairs of sequences that 
are not yet distinguished by D, we loop over all candidates and compute for each candidate c the number 
A(c, D) of pairs of sequences that are distinguished by c but not by D, then add the candidate c with largest 
A value to D. Two sequences s and s' are distinguished by a candidate c iff exactly one of s and s 1 appears in 
the list P c of perfect matches of c, which is available from the candidate generation phase. A simple method 
for computing A values is to maintain an n x n symmetric matrix indicating which of the pairs of sequences 
are already distinguished, and then to probe the \P C \ ■ (n — \P C \) entries in this matrix corresponding to pairs 



(s, s') with s G P c and s' ^ P c when computing A(c,D). A more efficient method is based on maintaining 
the partition defined on the set of sequences by D. If the partition defined by D consists of sets Si, . . . , Sfc, 
then we can compute A(c, D) in 0{k + \P C \) = 0(n) time using the observation that 



In addition to the fast partition based computation, our implementation of the greedy selection phase uses a 
lazy strategy for updating the A values, based on the observation that they are monotonically non-increasing 
during the algorithm (see Figure 1). 

4 Extended Barcoding Requirements 

In this section we describe the modifications needed to the basic implementation given in previous section 
when handling practical extensions of the barcoding problem. 

Degenerate bases. In the presence of degenerate bases in the input genomic sequences, the hybridization of 
a particular distinguisher may depend on which bases are actually present positions with degeneracy greater 
than 1. The greedy setcover algorithm takes into account this possibility for uncertain hybridization by only 
counting a pair (g, g') as distinguished by a candidate c if and only if c has a perfect match with one and only 
perfect mismatches with the other. For each generated candidate, in addition to the list of sequences that 
have only perfect matches we also save a list containing all sequences with at least one uncertain match. This 
allows fast computation of the (typically much longer) list of sequences having only perfect mismatches. To 
avoid generating candidate distinguishers containing degenerate bases, we set the DONE flag as soon as the 
corresponding substring extends past a degenerate base. Finally, since the partition of genomic sequences is 
no longer defined in the presence of uncertain hybridization, formula (1) is no longer applicable and we have 
to use the n x n "distinguished so far" matrix for computing A values. 

Biochemical constraints on individual distinguishers. Since selected distinguishers must hybridize 
under the same experimental conditions, in practice it is natural to impose a variety of constraints on 
individual distinguishers, such as minimum and maximum length, GC content, melting temperature, etc. 
Furthermore, we may want to avoid using as distinguishers strings which appear in other organisms that may 
contaminate the sample. All individual constraints are easily incorporated as a simple filter in the candidate 
generation phase. 

Redundancy constraints and minimum edit distance constraints. In practice, robust identification 
requires redundant distinguishability, i.e., more than one distinguisher distinguishing any given pair of ge- 
nomic sequences. One may also impose a lower bound on the edit distance between any pair of selected 
distinguishers [15]. Taking into account redundancy requirements is done by maintaining the number of 
times each pair of genomic sequences has been distinguished. In order to incorporate the minimum edit 
distance constraint, after selecting a distinguisher we eliminate from consideration all candidates that are 
within an edit distance smaller than the given threshold. 

5 Experimental Results 

We performed experiments on both randomly generated instances and whole microbial genomes extracted 
from the NCBI databases [7]. Random testcases were generated from the uniform distribution induced by 
assigning equal probabilities to each of the four nucleotide; these testcases do not contain any nucleotides 
with degeneracy greater than 1. The NCBI testcase represents a selection of 29 complete microbial sequences, 
varying in length between 490,000 and 4,750,000 bases (over 76 million bases in total). All experiments were 
run on a PowcrEdgc 2600 Linux server with 4 Gb of RAM and dual 2.8 GHz Intel Xeon CPUs - only one 
of which is used by our sequential algorithms. 

5.1 Algorithm Scalability 

As described in Section 3, there are two main phases in the algorithm: candidate distinguisher generation, 
and greedy candidate selection. Figure 2 gives the average candidate selection CPU time for n random 
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Fig. 2. Candidate selection CPU time (in seconds) for n random sequences of length 10,000 and redundancy 1, 
averaged over 10 instances of each size. 



Table 1. Average statistics for instances with 1,000 random sequences of length 10,000, redundancy 1, and number 
of source sequences varying from 1,000 down to 1. 



# Source Seq. 


1000 


50 


10 


5 


4 


3 


2 


1 


^Candidates 


7213568.8 


1438645.4 


402700.5 


225842.7 


186871.3 


146054.3 


102800.6 


55738.9 


#Matches 


55696250.9 


35246584.9 


23162063.6 


18384468.1 


16898179.9 


15037610.7 


12532936.0 


8741587.0 


Gen. time 


132.3 


44.7 


35.5 


31.4 


31.3 


30.6 


28.1 


24.9 


Selection time 


31.7 


10.7 


5.3 


3.6 


3.4 


3.1 


2.3 


1.6 


^Distinguishers 


14.1 


14.1 


14.1 


14.1 


14.0 


14.1 


14.2 


14.5 



sequences of length 10,000 and redundancy 1, averaged over 10 instances of each size. Combining the two 
speed-up techniques for this phase (partition based coverage gain computation and lazy update of candidate 
gains) results in over two orders of magnitude reductions in runtime. 

As mentioned in Section 3, a further speed-up technique is to generate candidate distinguishers only 
from a small number of "source" input sequences. Table 1 gives the average number of candidates, number 
of matches, runtimes for candidate generation and greedy selection, and number of selected distinguishers 
for instances with 1,000 random sequences of length 10,000 and redundancy 1, when the number of source 
sequences is varied from 1,000 down to 1 (the source sequences were chosen at random). Although this speed- 
up can potentially affect solution quality, we found that on large instances the solution quality loss is minimal 
even when we generate candidates based on a single input sequence; this case corresponds to pre-assigning a 
barcode of all l's to the source sequence. The technique reduces significantly both the memory requirement 
(which is proportional to the number of candidates and the number of times they match input sequences) 
and the runtime required for candidate generation and greedy selection. 

5.2 Solution Quality on Random Data 

Table 2 gives the number of distinguishers returned by the setcover greedy algorithm for redundancy varying 
between 1 and 20 on between 10 and 1,000 random sequences of length 10,000. For comparison we include in 
the table the results obtained by the information content heuristic results of [1], as well as the information 
theoretic lower bound of |~log 2 n] for the case when the redundancy requirement is 1. We note that the 
number of distinguishers returned by the setcover greedy algorithm is virtually identical to that returned 
by the information content heuristic, despite the latter one having a better approximation guarantee [1]. 
Furthermore, the results for redundancy one are within 50% of the information theoretic lower bound for 
the range of instance sizes considered in this experiment. The gap between the solutions returned by the 




algorithms and the lower bound does increase with the number of sequences; however it is not clear how 
much of this increase is caused by degrading algorithm solution quality, and how much by degrading lower 
bound quality. 

We also compared our setcover greedy algorithm with a recently proposed multi-step rounding algorithm 
for set multicover [2]. The rounding algorithm has the following steps: 

1. Solve the fractional relaxation of the natural integer program formulation of problem [15] (we used the 
commercial solver CPLEX 9.0 for implementing this step) 

2. Scale the fractional solution by an appropriate constant factor (see [2] for details) 

3. Detcrministically select all distinguishers with a scaled fractional value exceeding 1 

4. Randomly select a subset of the remaining candidates, each candidate being chosen with a probability 
equal to the scaled fractional value 

5. Finally, if the selected set of distinguishers is not yet feasible, add further distinguishers using the setcover 
greedy algorithm 

The approximation guarantee established in [2] for the general set multicover problem translates into an 
approximation factor of 2 Inn — In r for robust string barcoding with redundancy r, which suggests that the 
multi-step rounding algorithm is likely to improve upon the setcover greedy for high redundancy constraints. 
Table 3 gives the results of experiments comparing the setcover greedy and multi-step rounding algorithms 
on tcstcases consisting of up to 200 random sequences, each of length 1,000 for redundancy requirement 
ranging from 1 to 100. The results confirm that the multi-step rounding algorithm has better solution 
quality than setcover greedy when redundancy requirement is very large relative to the number of sequences 
(entries typeset in boldface), yet the setcover greedy still has the best performance for most combinations of 
parameters. 

5.3 Experiments on Genomic Data 

We ran our algorithm on a set of 29 complete microbial genomic sequences extracted from NCBI databases 
[7]. Sequence lengths in the set vary between 490 Kbases and 4.75 Mbases, with an average length of 2.6 
Mbases (over 76 Mbases total). Unlike random testcases, the sequences in the NCBI data set contain a small 
number of degenerate bases, 861 bases in total. Therefore, we cannot use the partition method for computing 
the number of sequence pairs distinguished by a candidate in the greedy selection phase, and we have to use 
the slower matrix datastructure. In these experiments we varied the redundancy requirement from 1 to 20. To 
sec the effect of length and edit distance requirements on the number of distinguishers, for each redundancy 
requirement we computed both an unconstrained solution, and a solution in which distinguishers must have 
length between 15 and 40, and there should be a minimum edit distance of 6 between every two selected 
distinguishers (these values are similar to those used in [15]. In all experiments, we generated candidates 
based only on the shortest sequence of 490 Kbases. 

The results on the NCBI dataset are given in Table 4. Naturally, meeting higher redundancy constraints 
requires more distinguishers to be selected. Additional length and edit distance constraints further increase 

Table 2. Number of distinguishers returned by the setcover greedy algorithm (SGA) for varying redundancy and 
number of sequences. For each value of n we report the average over 10 testcases, each consisting of n random sequences 
of length 10,000. For comparison we include information content heuristic results (ICH) and the information theoretic 
lower bound of [log 2 n~\ for redundancy 1 (LB). 
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Table 3. Number of distinguishes returned by the setcover greedy algorithm (SGA) and the multi-step rounding 
algorithm in [2] (RND) for varying redundancy and number of sequences. For each value of n we report the average 
over 10 testcases, each consisting of n random sequences of length 1,000. Boldface entries correspond to instances for 
which the multi-step rounding algorithm has better solution quality than setcover greedy. 
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Table 4. Results on a set of 29 NCBI complete microbial genomes. Candidate generation time is approximately 335 
seconds for all combinations of parameters. 
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the number of distinguishers, but the latter is still within reasonable limits. The length constraints reduce the 
number of candidates (from 1,775,471 to 122,478), which, for low redundancy values has the effect of reducing 
greedy selection time. However, for high redundancy requirements the reduction in number of candidates is 
offset by the increase in solution size, and greedy selection becomes more time consuming with length and 
edit distance than without (selection time grows roughly linearly with solution size). 

6 Conclusions 

In this paper we have given highly scalable algorithms for the robust string barcoding problem, and have 
shown that distinguisher selection based whole genomic sequences results in a number of distinguishers nearly 
matching the information theoretic lower bounds for the problem. 

In ongoing work we are exploring heuristics and approximation algorithms for several extensions of the 
string barcoding problem. First, we are considering the use of probe mixtures as distinguishers. With most 
microarray technologies it is feasible to spot/synthesize a mixture of oligonucleotides at any given microarray 
location. The DNA of a pathogen will hybridize to such a location if it contains at least one substring which 
is the Watson-Crick complement of one of the oligonucleotides in the mixture. Using oligonucleotide mixtures 
as distinguishers can reduce the number of spots on the array - and therefore barcode length - closer to the 
information theoretical lower-bound of log 2 n. The reduction promises to be particularly significant when 
reliable hybridization requires relatively long distinguishers; in these cases even the optimum barcoding 
length is far from log 2 n [15]. A special case of this approach is the use of degenerate distinguishers similar to 
the degenerate primers that have been recently employed in multiplex PCR amplification [12, 16]. Degenerate 



distinguishcrs are particularly attractive for string barcoding since their synthesis cost is nearly identical to 
the synthesis cost of a single non-degenerate distinguisher (synthesis requires the same number of steps, the 
only difference is that multiple nucleotides must be added in some of the synthesis steps). 

In many practical pathogen identification applications collected biological samples may contain the DNA 
of multiple pathogens. This issue is considered to be particularly significant in medical diagnosis applications, 
see, e.g., [9] for studies in detecting more than one HPV (human papiloma virus) genotype with varying rate 
of multiple HPV infections carried by the same HPV carrier. In future work we plan to develop extensions 
of the barcoding technique that can reliably detect multiple pathogens for a given bound on the number of 
pathogens present. 
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